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1 ABSTRACT 

We have developed an automatic abstract genera- 
tion system for Japanese expository writings based 
on rhetorical structure extraction. The system first 
extracts the rhetorical structure, the compound of 
the rhetorical relations between sentences, and then 
cuts out less important parts in the extracted struc- 
ture to generate an abstract of the desired length. 
Evaluation of the generated abstract showed that it 
contains at maximum 74% of the most important 
sentences of the original text. The system is now 
utilized as a text browser for a prototypical interac- 
tive document retrieval system. 

2 INTRODUCTION 



Abstract generation is, like Machine Translation, one 
of the ultimate goal of Natural Language Process- 
ing. However, since conventional word -frequency- 
based abstract generation systems(e.g. [Kuhn 58 ) 



are lacking in inter-sentential or discourse-structural 
analysis, they are liable to generate incoherent ab- 
stracts. On the other hand, conventional knowl- 
edge or scrip t -based a bstract generation systems(e.g. 
[ Lehnert 8Ct] , Fum 86t] ), owe their success to the lim- 
itation of the domain, and cannot be applied to doc- 
ument with varied subjects, such as popular scientific 
magazine. To realize a domain-independent abstract 
generation system, a computational theory for ana- 
lyzing linguistic discourse structure and its practical 
procedure must be established. 

Hobbs developed a theory in which he arranged 
three kinds of relationships between sentences from 
the text coherency viewpoint | Hobbs 79 1. 

Grosz and Sidner proposed a theory which ac- 
counted for interactions between three notions on 



discourse: linguistic structure, intention, and atten- 



tion [Grosz et al. 



Litman and Allen described a model in which 
a discourse structure of conversation was built by 



recognizing a participant's plans |Litman et al. 87|. 
These theories all depend on extra-linguistic knowl- 
edge, the accumulation of which presents a problem 
in the realization of a practical analyzer. 

Cohen proposed a framework for analyzing the 
structure of argumentative discourse Cohen 87 1, yet 
did not provide a concrete identification procedure 
for 'evidence' relationships between sentences, where 
no linguistic clues indicate the relationships. Also, 
since only relationships between successive sentences 
were considered, the scope which the relationships 
cover cannot be analyzed, even if explicit connectives 
are detected. 

Mann and Thompson proposed a linguistic struc- 
ture of text describing relationships between sen- 



tences and their relative importance [Mann et al. 87 



However, no method for extracting the relationships 
from superficial linguistic expressions was described 
in their paper. 

We have developed a computational model of 
discourse for Japanese expository writings, and im- 
plemented a pr actical proc edure for extracting dis- 
course structure I Bumita 92 [. In our model, discourse 
structure is defined as the rhetorical structure, i.e., 
the compound of rhetorical relations between sen- 
tences in text. Abstract generation is realized as a 
suitable application of the extracted rhetorical struc- 
ture. In this paper we describe briefly our discourse 
model and discuss the abstract generation system 
based on it. 



3 RHETORICAL STRUCTURE 

Rhetorical structure represents relations between var- 
ious chunks of sentences in the body of each section. 
In this paper, the rhetorical structure is represented 
by two layers: intra-paragraph and inter-paragraph 
structures. An intra-paragraph structure is a struc- 
ture whose representation units are sentences, and an 
inter-paragraph structure is a structure whose rep- 
resentation units are paragraphs. 

In text, various rhetorical patterns are used to 
clarify the principle of argument. Among them, con- 
nective expressions, which state inter-sentence rela- 
tionships, are the most significant. The typical gram- 
matical categories of the connective expressions are 
connectives and sentence predicates. They can be 
divided into the thirty four categories which are ex- 
emplified in Table H. 

Table 1: Example of rhetorical relations 



Relation 


Expressions 


serial (<SR>) 


dakara (thus) 


summarization 

(<SM>) 


kekkyoku (after all) 


negative (<NG>) 


shikashi (but) 


example (<EG>) 


tatoeba (for example) 


especial(<ES>) 


tokuni (particularly) 


reason (<RS>) 


nazenara (because) 


supplement (<SP>) 


mochiron (of course) 


background (<BI>) 


juurai (hitherto) 


parallel (<PA>) 


mata (and) 


extension (<EX>) 


kore wa (this is) 


rephrase (<RF>) 


tsumari (that is to say) 


direction (<DI>) 


kokode wa . . . wo noberu 
(here ... is described) 



The rhetorical relation of a sentence, which is 
the relationship to the preceding part of the text, 
can be extracted in accordance with the connective 
expression in the sentence. For a sentence without 
any explicit connective expressions, extension rela- 
tion is set to the sentence. The relations exemplified 
in Table |l| are used for representing the rhetorical 
structure. 

Fig. ^ shows a paragraph from an article titled 
"A Zero-Crossing Rate Which Estimates the Fre- 
quency of a Speech Signal," where underlined words 
indicate connective expressions. Although the fourth 
and fifth sentences are clearly the exemplification 
of the first three sentences, the sixth is not. Also 
the sixth sentence is the concluding sentence for the 



first five. Thus, the rhetorical structure for this text 
can be represented by a binary-tree as shown in 
Fig. p|.This structure is also represented as follows: 

[[[1 <EX> 2] <ES> [3 <EG> [4 <EX> 5]]] <SR> 6] 



1: In the context of discrete-time signals, zero- 
crossing is said to occur if successive samples 

have different algebraic signs. 
2: The rate at which zero crossings occur is a 

simple measure of the frequency content of a 

signal. 
3: This is particularly true of narrow band 

signals. 
4: For example, a sinusoidal signal of frequency 

Fq, sampled at a rate Fg, has F^/Fg samples 

per cycle of the sine wave. 
5: Each cycle has two zero crossings so that the 

long-term average rate of zero-crossings is 

Z = 2Fo/Fs. 
6: Thus, the average zero-crossing rate gives a 

reasonable way to estimate the frequency of a 

sine wave. 

(L.R.Rabiner and R.W.Schafer, Digital Processing of 
Speech Signals, Prentice-Hall, 1978, p. 127.) 



Figure 1: Text example 




Figure 2: Rhetorical structure for the text in Fig.l 

The rhetorical structure is represented by a bi- 
nary tree on the analogy of a syntactic tree of a natu- 
ral language sentence. Each sub tree of the rhetorical 
structure forms an argumentative constituent, just as 
each sub-tree of the syntactic tree forms a grammat- 
ical constituent. Also, a sub-tree of the rhetorical 
structure is sub-categorized by a relation of its par- 
ent node as well as a syntactic tree. 



4 RHETORICAL STRUCTURE 
EXTRACTION 

The rhetorical structure represents logical relations 
between sentences or blocks of sentences of each sec- 
tion of the document. A rhetorical structure analysis 
determines logical relations between sentences based 
on linguistic clues, such as connectives, anaphoric 
expressions, and idiomatic expressions in the input 
text, and then recognizes an argumentative chunk of 
sentences. 

Rhetorical structure extraction consists of six 
major sub-processes: 

(1) Sentence analysis accomplishes morphological 

and syntactic analysis for each sentence. 

(2) Rhetorical relation extraction detects rhetorical 

relations and constructs the sequence of sen- 
tence identifiers and relations. 

(3) Segmentation detects rhetorical expressions be- 

tween distant sentences which define rhetorical 
structure. They are added onto the sequence 
produced in step 2, and form restrictions for 
generating structures in step 4. For example, 
expressions like "... 3 reasons. First, . . . Sec- 
ond, . . . Third, . . . " , and ". . . Of course, . . . 
. . . But, ..." are extracted and the structural 
constraint is added onto the sequence so as to 
form a chunk between the expressions. 

(4) Candidate generation generates all possible 

rhetorical structures described by binary trees 
which do not violate segmentation restrictions. 

(5) Preference judgement selects the structure can- 

didate with the lowest penalty score, a value 
determined based on preference rules on ev- 
ery two neighboring relations in the candidate. 
This process selects the structure candidate with 
the lowest penalty score, a value determined 
based on preference rules on every two neigh- 
boring relations in the candidate. A preference 
rule used in this process represents a heuris- 
tic local preference on consecutive rhetorical 
relations between sentences. Consider the se- 
quence [P <EG> Q <SR> R] , where P, Q, R are 
arbitrary (blocks of) sentences. The premise 
of R is obvously not only Q but both P and Q. 
Since the discussion in P and Q is considered to 
close locally, structure [[P <EG> Q] <SR> R] 
is preferable to [P <EG> [Q <SR> R] ] . Penalty 
scores are imposed on the structure candidates 
violating the preference rules. For example, 



for the text in Fig. H, the structure candidates 
which contain the substructure 
[3 <EG> [ [4 <EX> 5] <SR> 6] ] , which says 
sentence six is the entailment of sentence four 
and five only, are penalized. The authors have 
investigated all pairs of rhetorical relations and 
derived those preference rules. 

The system analyzes inter-paragraph structures 
after the analysis of intra-paragraph structures. While 
the system uses the rhetorical relations of the first 
sentence of each paragraph for this analysis, it exe- 
cutes the same steps as it does for the intra-paragraph 
analysis. 

5 ABSTRACT GENERATION 

The system generates the abstract of each section of 
the document by examining its rhetorical structure. 
The process consists of the following 2 stages. 

(1) Sentence evaluation 

(2) Structure reduction 

In the sentence evaluation stage, the system calcu- 
late the importance of each sentence in the original 
text based on the relative importance of rhetorical 
relations. They are categorized into three types as 
shown in Table g. For the relations categorized into 
RightNucleus, the right node is more important, from 
the point of view of abstract generation, than the left 
node. In the case of the LeftNucleus relations, the 
situation is vice versa. And both nodes of the Both- 
Nucleus relations are equivalent in their importance. 
For example, since the right node of the serial rela- 
tion (e.g., yotte (thus)) is the conclusion of the left 
node, the relation is categorized into RightNucleus, 
and the right node is more important than the left 
node. 

The Actual sentence evaluation is carried out 
in a demerit marking way. In order to determine im- 
portant text segments, the system imposes penalties 
on both nodes for each rhetorical relation according 
to its relative importance. The system imposes a 
penalty on the left node for the RightNucleus rela- 
tion, and also on the right node for the LeftNucleus 
relation. It adds penalties from the root node to the 
terminal nodes in turn, to calculate the penalties of 
all nodes. 

Then, in the structure reduction stage, the sys- 
tem recursively cuts out the nodes, from the terminal 
nodes, which are imposed the highest penalty. The 
list of terminal nodes of the final structure becomes 
an abstract for the original document. Suppose that 



the abstract is longer than the expected length. In 
that case the system cuts out terminal nodes from 
the last sentences, which are given the same penalty 
score. 

If the text is written loosely, the rhetorical struc- 
ture generally contains many BothNucleus relations 
(e.g., parallel(77iaia(and, also)), and the system can- 
not gradate the penalties and cannot reduce sen- 
tences smoothly. 

After sentences of each paragraph are reduced, 
inter-paragraph structure reduction is carried out in 
the same way based on the relative importance judge- 
ment on the inter-paragraph rhetorical structure. 

If the penalty calculation mentioned above is 
accomplished for the rhetorical structure shown in 
Fig. E, each penalty score is calculated as shown in 
Fig. |3, In Fig. ^ italic numbers are the penalties the 
system imposed on each node of the structure, and 
broken lines are the boundary between the nodes im- 
posed different penalty scores. The figure shows that 
sentence four and five have penalty score three, that 
sentence three has two , that sentence one and two 
have one, and that sentence six has no penalty score. 
In this case, the system selects sentence one, two, 
three and six for the longest abstract, and and also 
could select sentence one, two and six as a shorter 
abstract, and also could select sentence six as a still 
more shorter abstract. 

After the sentences to be included in the ab- 
stract are determined, the system alternately arranges 
the sentences and the connectives from which the re- 
lations were extracted, and realizes the text of the 
abstract. 

The important feature of the generated abstracts 
is that since they are composed of the rhetoricaly 
consistent units which consist of several sentences 
and form a rhetorical substructure, the abstract does 
not contain fragmentary sentences which cannot be 
understood alone. For example, in the abstract gen- 
eration mentioned above, sentence two does not ap- 
pear solely in the abstract, but appears always with 
sentence one. If sentence two appeared alone in the 
abstract without sentence one, it would be difficult 
to understand the text. 

6 EVALUATION 

The generated abstracts were evaluated from the point 
of view of key sentence coverage. 30 editorial articles 
of "Asahi Shinbun", a Japanese newspaper, and 42 
technical papers of "Toshiba Review", a journal of 
Toshiba Corp. which publishes short expository pa- 
pers of three or four pages, were selected and three 



Table 2: Relative importance of rhetorical relations 



Relation Type 


Relation 


Import. Node 


RightNucleus 


serial, 

summariza- 
tion, 
negative, . . . 


right node 


LeftNudeus 


example, 
reason, 
especial, 
supplement. 


left node 


BothNucleus 


parallel, 
extension, 
rephrase, . . . 


both nodes 




1 2 



Figure 3: Penalties on relative importance for the 
rhetorical structure in Fig. 2 



subjects judged the key sentences and the most im- 
portant key sentence of each text. As for the edito- 
rial articles. The average correspondence rates of the 
key sentence and the most important key sentence 
among the subjects were 60% and 60% respectively. 
As for the technical papers, they were 60% and 80 % 
respectively. 

Then the abstracts were generated and were 
compared with the selected key sentences. The re- 
sult is shown in Table ||. As for the technical papers, 
the average length ratio( abstract /original ) was 24 
%, and the coverage of the key sentence and the most 
important key sentence were 51% and 74% respec- 
tively. Whereas, as for the editorials, the average 
length ratio( abstract/original ) was 30 %, and the 
coverage of the key sentence and the most important 
key sentence were 41% and 60% respectively. 

The reason why the compression rate and the 
key sentence coverage of the technical papers were 



higher than that of the editorials is considered as 
follows. The technical papers contains so many rhe- 
torical expressions in general as to be expository. 
That is, they provide many linguistic clues and the 
system can extract the rhetorical structure exactly. 
Accordingly, the structure can be reduced further 
and the length of the abstract gets shorter, without 
omitting key sentences. On the other hand, in the 
editorials most of the relations between sentences are 
supposed to be understood semantically, and are not 
expressed rhetorically. Therefore, they lack linguis- 
tic clues and the system cannot extract the rhetorical 
structure exactly. 

Table 3: Key sentence coverage of the abstracts 



Material 


total 
num. 


length 
ratio 


cover ratio 


key 


most 








sentence 


important 
sentence 


editorial 


30 


0.3 


0.41 


0.60 


(Asahi Shinbun) 










tech. journal 


42 


0.24 


0.51 


0.74 


(Toshiba Review) 











generation systems(e.g. [Lehnert 8C|, [Fum 86|), the 
rhetorical structure extraction does not need pre- 
pared knowledge or scripts related to the original 
text , and can be used for texts of any domain , so 
long as they contain enough rhetorical expressions 
to be expository writings. Fourth, the generated 
abstract is composed of rhetoricaly consistent units 
which consist of several sentences and form a rhe- 
torical substructure, so the abstract does not contain 
fragmentary sentences which cannot be understood 
alone. 

The limitations of the system are mainly due 
to errors in the rhetorical structure analysis and the 
sentence-selection-type abstract generation, the eval- 
uation of the accuracy of the rhetorical structure 
analysis carried out previously ( [ Sumita 92| ) showed 
74%. Also, to make the length of the abstract shorter, 
It is necessary to utilize an inner-sentence analysis 
and to realize a phrase-selection-type abstract gen- 
eration based on it. The anaphora-resolution and 
the topic-supplementation must also be realized in 
the analysis. 

The system is now utilized as a text browser for 
a prototypical interactive document retrieval system. 
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